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Abstract — We  consider  a  decentralized  cognitive  radio  network 
in  which  autonomous  secondary  users  seek  spectrum  opportu¬ 
nities  in  licensed  spectrum  bands.  We  assume  that  the  primary 
users’  channel  occupancy  follows  a  Markovian  evolution,  and 
formulate  the  spectrum  sensing  problem  as  a  Decentralized 
Partially  Observable  Markov  Decision  Process  (DEC-POMDP). 
We  develop  a  distributed  Reinforcement  Learning  (RL)  algorithm 
that  allows  each  autonomous  cognitive  radio  to  distributively 
learn  its  own  spectrum  sensing  policy.  The  resulting  decentralized 
sensing  policy  enables  secondary  users  to  non-cooperatively  reach 
an  equilibrium  that  leads  to  high  utilization  of  idle  channels 
while  minimizing  the  collisions  among  secondary  cognitive  radios. 
Moreover,  we  propose  a  decentralized  channel  access  policy  that 
permits  controlling,  with  high  accuracy,  the  collision  probability 
with  primary  users.  Our  numerical  results  validate  the  robustness 
of  this  collision  probability  control  as  the  sensing  noise  changes. 
They  also  show  the  efficiency  of  the  proposed  learning  algorithm 
in  improving  the  spectrum  utilization. 

Index  Terms — Cognitive  radio,  reinforcement  learning. 

I.  Introduction 

Opportunistic  Spectrum  Access  (OS A)  [1]  has  been  en¬ 
visioned  as  a  promising  technique  to  exploit  the  spectrum 
vacancies,  which  permits  unlicensed  secondary  users  to  access 
the  primary  channels  opportunistically  when  the  primary  users 
who  own  the  spectrum  rights  are  not  transmitting.  Cognitive 
Radio  (CR)  devices  have  been  founded  as  a  platform  to  realize 
such  OSA  techniques.  In  general,  CR’s  are  assumed  to  be  able 
to  sense  and  adapt  to  their  Radio  Frequency  (RF)  environment. 

In  this  paper,  we  consider  a  decentralized  CR  network  in 
which  each  secondary  user  tries  to  obtain,  independently,  the 
best  estimate  of  the  status  of  the  primary  channels  based  on 
its  own  local  information.  In  particular,  when  the  primary 
channel  states  follow  a  Markovian  evolution,  a  cognitive  user 
can  utilize  its  history  of  observations  and  actions  in  order  to 
derive  a  better  sensing/accessing  policy.  This  problem  can  then 
be  formulated  as  a  Decentralized  Partially  Observable  Markov 
Decision  Process  (DEC-POMDP)  and  has  been  discussed 
in  several  recent  studies.  For  example,  in  [2],  the  authors 
suggested  a  Medium  Access  Control  (MAC)  protocol  for 
decentralized  ad-hoc  CR  networks  by  modeling  the  system 
as  a  POMDP  that  is  equivalent  to  a  Markov  Decision  Process 
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(MDP)  with  an  infinite  number  of  states.  The  corresponding 
optimal  sensing  policy  that  maximizes  the  total  discounted 
return  was  shown  to  be  computationally  prohibitive.  Thus,  an 
optimal  myopic  policy  was  derived  such  that  it  maximizes  the 
instantaneous  rewards.  The  myopic  policy  that  was  formulated 
in  [2]  is  optimal  for  a  single-user  setup,  and  is  suboptimal 
when  applied  to  a  multiuser  setting  because  it  would  lead 
to  collisions  between  secondary  users  when  more  than  one 
user  try  to  access  the  same  channel.  On  the  other  hand,  in 
[3]  the  authors  proposed  three  different  sensing  policies  for 
multiuser  OSA:  The  first  policy  is  based  on  a  cooperative 
protocol  in  which  secondary  users  exchange  their  beliefs  about 
the  channel  states  at  each  time  slot.  The  second  policy  applies 
learning  techniques  to  obtain  an  estimate  of  the  other  users’ 
beliefs,  and  the  third  policy  is  based  on  a  single-user  approach 
in  which  the  cognitive  users  act  non-cooperatively.  We  note 
that  [3]  assumes  perfect  sensing  of  the  primary  channels, 
which  we  do  not  assume  throughout  this  paper. 

In  [4],  a  suboptimal  sensing/access  policy  was  derived  for 
cooperative  cognitive  networks  since  it  is  not  easy  to  solve  the 
Bellman  equation  that  corresponds  to  the  formulated  POMDP 
model.  However,  the  assumed  model  did  not  ensure  full 
utilization  of  spectrum  resources  because  only  one  primary 
channel  was  accessed  at  each  time  instant  collectively  by  all 
secondary  users.  This  leads  to  low  network  throughput  since 
all  the  secondary  users  are  assumed  to  sense  the  same  primary 
channel  at  a  time.  The  main  advantage  of  this  model,  however, 
was  that  it  achieves  better  sensing  performance.  The  trade-off 
between  the  sensing  accuracy  and  the  secondary  throughput 
has  been  discussed  recently  in  [5]. 

We  believe  that  the  solution  to  these  issues  is  to  make 
the  so-called  CR’s  indeed  cognitive,  i.e.  to  achieve  smart 
performance,  the  CR’s  should  have  the  ability  to  learn  from 
their  observed  environment  and  the  past  actions.  Indeed,  it  can 
be  argued  that  learning  from  experience  must  be  at  the  heart 
of  any  cognitive  system.  Recently,  this  view  is  gaining  impor¬ 
tance  within  the  CR  research  community  as  is  evident  by  the 
application  of  learning  techniques  to  CR’s.  For  example,  the 
multi-agent  Reinforcement  Learning  (RL)  algorithm,  known  as 
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Q-Learning,  was  applied  in  [6]  to  achieve  interference  control 
in  decentralized  Wireless  Regional  Area  Networks  (WRAN). 
In  [7],  the  authors  developed  a  Q-learning  algorithm  for  an 
auction-based  dynamic  spectrum  access  protocol,  which  is 
different  from  the  DEC-POMDP  structure  of  our  proposed 
model.  To  the  best  of  our  knowledge,  none  of  the  CR  studies 
that  assume  an  underlying  POMDP  structure  has  used  the  Q- 
learning  algorithm  to  solve  the  OS  A  problem  [2]-[4].  The 
literature  on  learning  techniques  to  achieve  CR  goals  is  still 
at  an  infancy,  although  there  is  a  rich  literature  on  machine 
learning  in  computer  science  and  classical  statistical  learning 
that  provides  a  great  starting  point  [8]. 

In  this  paper,  we  formulate  the  channel  sensing  in  de¬ 
centralized  cognitive  networks  as  a  DEC-POMDP  problem. 
Unlike  [2],  our  approach  considers  a  multi-user  setting  and 
we  propose  a  channel  sensing  policy  that  takes  into  account 
the  collisions  among  secondary  users.  Our  proposed  sensing 
policy  is  based  on  the  distributed  RL.  Note  that,  we  use  the  RL 
to  derive  the  sensing  policy  rather  than  to  obtain  interference 
control  as  in  [6].  This  algorithm  achieves  two  main  goals: 
Deriving  a  sensing  policy  based  on  the  history  of  actions  and 
observations,  and  minimizing  the  collisions  between  secondary 
users  while  competing  for  channel  access  opportunities.  On  the 
other  hand,  we  propose  a  channel  access  mechanism  that  limits 
the  collisions  between  primary  and  secondary  users  when 
secondary  users  have  noisy  observations  about  the  primary 
channels.  Our  channel  access  scheme  ensures  high  accuracy 
and  robustness  in  controlling  the  collision  probability  with 
primary  channels,  thus  guaranteeing  the  Quality  of  Service 
(QoS)  requirements  of  primary  users. 

The  remainder  of  this  paper  is  organized  as  follows:  Section 
II  defines  the  system  model.  In  sections  III  and  IV,  we  derive 
both  the  accessing  and  sensing  policies  for  cognitive  users.  We 
show  the  simulation  results  in  section  V.  Section  VI  concludes 
the  paper. 


II.  System  Model 

We  consider  a  wireless  network  having  a  set  of  primary 
channels  C  =  {1, The  channels’  occupancy  states 
are  assumed  to  be  independent  and  following  a  Markovian 
evolution.  A  set  of  distributed  users  form  a  secondary  network 
that  is  assumed  to  rely  on  cognitive  techniques  to  access  these 
primary  channels  when  they  are  idle.  The  set  of  secondary 
users  in  the  system  is  denoted  by  K,s  =  {1,  ...,KS}.  The 
secondary  network  forms  a  multiple  access  channel  in  which 
each  secondary  user  independently  searches  for  a  spectrum 
opportunity  in  order  to  communicate  with  a  secondary  base 
station,  as  depicted  in  Fig.  1.  Every  secondary  user  j  £  K,s 
is  assumed  to  be  able  to  sense  only  one  primary  channel  at  a 
time,  and  we  assume  that  secondary  users  do  not  cooperate. 
This  is  a  reasonable  assumption  in  decentralized  networks  in 
which  there  is  no  control  channels  for  ensuring  collaboration 
among  secondary  users. 

We  identify  the  overall  system  made  of  primary  channels 
and  the  Ks -secondary  users  as  a  DEC-POMDP  [9]  by  defining 
the  state  of  the  system  as  s (k)  =  (si(fc), ...,  Sl(/c))  £  S 
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Fig.  1.  Cognitive  Radio  Network  (CRN)  with  distributed  secondary  nodes 


where  Si(k)  £  {0,1}  represents  the  state  of  channel  i  £  C 
as  being  idle  (0)  or  busy  (1)  in  time  slot  k,  and  S  is  the  set 
of  all  possible  states  s(fc).  We  define  a  =  (ai,  •  ■  •  ,a-Ke)  as 
the  joint  action  of  all  secondary  users  (agents)  and  P(s,  a,  s') 
to  be  the  probability  of  transition  from  state  s  to  s'  when 
taking  the  joint  action  a.  The  transitions  of  every  channel’s 
state  are  independent  of  the  other  states  and  these  transitions 
are  assumed  to  follow  a  Markovian  evolution  as  mentioned 
above.  The  state  transition  matrix  P  of  the  state  vector  s (k)  is 
therefore  P  =  Pi(8>---®Pl,  where  IP,  is  the  state  transition 
matrix  of  channel  i,  and  ®  denotes  the  Kronecker  product. 

Note  that,  the  transition  probabilities  P(s,a,  s')  (for 
(s,  s')  £  S 2)  are  independent  of  the  secondary  user  actions 
since  they  are  determined  by  the  evolution  of  the  primary 
channels  states,  i.e.  P(s,a,  s')  =  P(s,  s'),  where  P(s,s')  is 
obtained  from  the  state  transition  matrix  P.  Similarly,  for  an 
individual  channel  i  £  C,  the  transition  probabilities  Pi(l,l') 
(for  (l,  V)  £  {0,  l}2)  are  obtained  from  P,;. 

The  action  of  secondary  user  j  £  K,s  at  time  k  is  denoted 
by  (ij  (k)  £  C  which  represents  the  index  of  the  primary 
channel  that  user  j  £  1CS  should  sense  during  time  slot 
k.  We  define  Yj(fc,j)  to  be  the  observation  of  secondary 
j  £  ICS  on  channel  i  £  C  in  time  slot  k  which  is  assumed 
to  be  the  output  of  a  Binary  Symmetric  Channel  (BSC)  where 
Pr {Yi(k,j)  ^  Sj(fc)}  =  Vi  is  the  crossover  probability.  As 
a  result,  is  a  discrete  random  variable  with  distinct 

probability  mass  functions  (pmf)  /o  and  /i  when  Si(k)  =  0 
and  Si(k)  =  1,  respectively. 

Let  Y i(j)  denotes  the  vector  of  observations  up  to  time 
slot  k  obtained  by  secondary  j  £  1CS  on  channel  i  £  C.  Let 
K *(j)  denote  the  time  slot  indices  up  to  slot  k  when  channel  i 
was  sensed  by  secondary  user  j.  Also,  let  Y fc(j)  =  {Y*(j)  : 
i  £  C}  be  the  collection  of  observations  up  to  slot  k  on  all 
primary  channels  obtained  by  the  y'-th  secondary  user. 

III.  Channel  Access  Mechanism 

The  sensing  and  access  operations  of  the  secondary  users 
are  scheduled  as  is  shown  in  Fig.  2,  where  we  consider  that 
a  secondary  user  senses  a  primary  channel  during  the  sensing 
period  r.  Primary  users  are  assumed  to  always  start  their 


Fig.  2.  Channel  Access  Policies 


transmission  at  the  beginning  of  a  frame  of  duration  Tf  so 
that  a  primary  channel  will  remain  free  during  the  secondary 
access  duration  if  it  was  free  during  the  corresponding  sensing 
period. 

A  cognitive  device  that  has  sensed  a  channel  can  access  that 
channel  during  the  remaining  frame  duration  of  Tf—r.  In  order 
to  avoid  collisions  among  secondary  users,  we  assume  that 
each  secondary  user  generates  a  random  backoff  time  before 
transmitting  [2].  If  more  than  one  secondary  users  decide  to 
access  the  same  channel,  the  channel  access  will  be  granted 
to  the  secondary  user  that  has  the  smallest  backoff  time. 

After  sensing  channel  i  =  affk),  secondary  user  j  G  ICS 
decides  whether  to  access  channel  i  based  on  its  observation 
sequence  y f(j)  =  { Vi{k',j )  :  k'  G  K£}  where  yi(k',j)  is 
a  realization  of  Yi(k',j).  In  order  to  achieve  a  probability  of 
collision  below  a  certain  bound,  we  may  apply  a  Neyman- 
Pearson  (NP)  type  detector  [10].  An  optimal  access  decision 
for  the  j-th  secondary  user  would  choose  one  of  the  two 
possible  hypothesis  Hi  =  {s,(A:)  =  0}  or  H0  =  {si(k)  =  1} 
in  time  slot  k  based  on  the  whole  observation  sequence  y1-  (j). 
However,  implementing  such  an  optimal  detector  becomes  too 
complicated  due  to  the  need  for  computing  the  distribution 
of  the  likelihood  ratio  of  Y f  (j)  which  is  a  random  sequence 
whose  length  increases  linearly  with  time.  Hence,  we  simplify 
the  detection  rule  by  assuming  that  the  decision  to  access  a 
channel  in  time  slot  k  is  based  only  on  the  current  observation. 

Let  a  be  the  false  alarm  probability  such  that  a  <  0.5.  The 
optimal  NP  detector  then  is  as  randomized  access  decision  rule 
8i(k,j)  for  secondary  j  to  access  channel  i  at  time  k.  This 
access  decision  can  be  viewed  as  a  binomial  random  variable 
denoted  by  5i(k,j)  whose  parameter  5i(k,j )  is  given  by: 


Si{k,  j)  = 


(k) 


a  q- 

0}  +  1_„’  -t{y»(fcj')=1}  ]  1i,j 


(*0 


if  a  <  Vi 
if  Q  >  Vi 


(k) 

where  I-  ■  =  T{aj(k)=i },  and  ab  =  1  if  condition  B  is 

satisfied,  and  0  otherwise.  Therefore,  secondary  user  j  decides 
to  access  a  sensed  channel  i  in  time  slot  k  only  if  8i(k,j )  =  1, 
which  happens  with  probability  S i(k,j). 

It  can  be  observed  that  the  collision  probability  on  a  partic¬ 
ular  channel  can  go  beyond  the  desired  threshold  because  the 
accessing  rule  in  a  decentralized  network  follows  an  OR-rule. 
For  that  reason,  we  will  design  a  channel  access  mechanism 
that  guarantees  a  certain  collision  probability  with  the  primary 
channels. 

We  define  Ejffk)  to  be  the  event  that  secondary  user  j  G  ICS 
decides  to  access  channel  i  G  C  at  time  k,  given  that  secondary 
user  j  has  sensed  channel  i  at  time  k.  Also,  we  let  Ei(k)  to  be 


the  event  that  channel  i  G  C  is  busy  at  time  k.  When  several 
secondary  users  sense  and  try  to  access  the  same  primary 
channel  i  G  C,  we  define  the  resulting  collision  probability 
as  Pc{i)  =  PrjUje^Cfc)  where  Zi(k)  is  the 

set  of  secondary  users  that  sense  channel  i  in  time  slot  k. 

Note  that  the  events  {Ej}i(k)\Ei(k)  :  j  G  /Cs}  are  indepen¬ 
dent  because  each  secondary  user  makes  its  access  decision 
independently  of  the  other  users,  after  having  sensed  the 
channel  i.  As  a  result,  the  collision  probability  on  channel 
i  can  be  expressed  as  Pc(i)  =  1  —  (1  —  a)Z^k\  where 
Zi(k)  =  \Zi(k)\  and  a  =  Pr  {Ejffk)\Ei{k)}  is  the  false 
alarm  probability  of  each  secondary  detector  that  results  from 
claiming  Hi  =  {si(k)  =  0}  (or  equivalently  {5i(k,  j)  =  1}) 
when  /Vo  =  { .s ( A: )  =  1}  is  true.  Therefore,  in  order  to  ensure 
an  overall  collision  probability  Pc(i)  =  Oq  in  channel  i,  each 
secondary  user  j  G  Z.^k)  should  set  its  false  alarm  probability 
to  a  =  1  —  (1  —  a0)1/Zi(fc). 

Since  each  secondary  user  does  not  know  the  total  number 
of  users  Zi{k )  that  are  sensing  primary  channel  i  £  C  at 
a  particular  time  k,  it  uses  the  expected  value  of  Zi[k) 
to  compute  its  false  alarm  probability  such  that  a  =  1  — 
(1  —  We  will  compute  this  expected  value  in  the 

followings  and  show,  through  simulations,  that  the  proposed 
access  technique  can  guarantee  an  upper  bound  on  the  collision 
between  primary  and  secondary  users. 

IV.  Sensing  policies  of  Distributed  Secondary 
Users 

We  define  the  belief  vector  of  channel  i  G  C  as  p  (k,j,  i )  = 
\po(k,j,i),Pi(k,j,i)\  where  pi(k,j,i)  =  Pv{si(k)  = 
which  represents  the  probability  of  Si(k)  being 
in  state  l  G  {0, 1}  in  time  slot  k,  given  the  past  observations 
Y^_1(j).  Let  b j{k)  =  [6^(1,  k),  ■  ■  ■  ,  bj( 2L,  k)]  be  the  belief 
vector  of  the  primary  system  according  to  secondary  user  j, 
where 

L 

bj  {u{s(k)),k)  =  Y[pSi(k)(k,j,i),  (1) 

i=  1 

given  that  u(s)  G  U  =  { 1,  ■  •  -  ,2L}  is  the  index  of  state 
s (fc)  =  (s1(fc),--  -  ,  Si(fc)).  The  belief  vector  b j(k)  is  a 
sufficient  statistic  for  an  optimal  OSA  protocol  in  a  single- 
user  setup  [2].  However,  in  our  case,  we  consider  a  distributed 
multi-user  scenario  and  b;(fc)  is  no  longer  a  sufficient  statistic 
for  optimal  decisions.  But  since  we  are  interested  in  applying 
RL  techniques  to  solve  the  DEC-POMDP  problem,  we  may 
still  use  belief  vector  b?(fc)  to  obtain  a  reasonably  good 
suboptimal  solution  in  a  distributed  multi-user  setting,  as 
shown  in  [6].  This  would  simplify  the  problem,  yet  leading 
to  near-optimal  solutions. 

At  each  time  slot,  each  secondary  user  updates  its  belief 
vector  about  the  states  of  the  channels  in  the  next  slot.  Suppose 
secondary  user  j  senses  channel  i  =  dj{k)  in  time  slot  k  and 
observes  Yi(k,j).  Then  it  updates  its  belief  about  the  state  of 


k 


k+1 


Fig.  3.  Sensing  and  Updating  the  Beliefs 


channel  i  in  time  k  +  1  using  Bayes’  formula  as  follows: 


Pm(k  + 1  ,j,i) 


ZLo  A(Yi(k,j))pi(k,j,i ) 


where  m  £  {0,1}.  For  the  unsensed  primary  channels  i'  ^ 
cLj(k),  the  j-th  secondary  user’s  belief  vector  is  simply  updated 
based  on  the  assumed  Markovian  evolution:  p(/,:  +  1 .  j.  ir)  = 

P {k,j,  ±  a,j(k). 

Figure  3  shows  the  update  procedure  in  which  thick  arrows 
represent  the  updates  using  Bayes’  formula,  whereas  thin 
arrows  represent  the  updating  of  beliefs  based  only  on  the 
assumed  Markovian  nature  of  the  channels. 


A.  The  Reward  and  Value  functions 

We  define  the  total  discounted  return  of  user  j  £  JCS  in 
time  slot  k  as  Rj(k)  =  ZZo  Zfjik  +  n),  where  rj(k)  is 
the  reward  of  secondary  user  j  in  time  slot  k  and  7  £  (0, 1) 
is  a  discounting  factor.  In  a  fully  observable  MDP,  an  agent 
j  £  K.s  may  define  the  value  of  a  state  s  in  slot  k  and  under 
a  policy  7 tj  as  [8]: 

Vp(s,k)=E{RJ{k)\s(k)  =  s}.  (3) 

Similarly,  the  function  Qj(s,a )  is  defined  as  the  expected 
return  starting  from  state  s,  taking  the  action  a,  and  then 
following  a  policy  7 r,-  thereafter  as: 

Q™3  (s,  a,  k)  =  E  {Rj(k)\s(k)  =  s,  a,j(k)  =  a }  .  (4) 

In  the  case  of  a  POMDP,  however,  the  actual  state  of  the 
system  is  the  belief  vector  b j(k).  Hence,  the  resulting  process 
is  an  infinite  state  MDP  which  makes  the  solutions  of  (3) 
and  (4)  computationally  expensive.  In  particular,  our  assumed 
model  of  a  DEC-POMDP  is  a  non-cooperative  multi-agent 
system  whose  solution  is  shown  to  be  NEXP-hard  [9],  Hence, 
we  will  solve  this  problem  by  finding  the  Q  values  of  the  DEC- 
POMDP  model  by  using  the  underlying  MDP  model  [11],  as 
explained  in  the  next  section. 

B.  Reinforcement  Learning  for  DEC-POMDP 

In  the  following,  we  extend  the  Q-learning  algorithm  that 
is  defined  for  centralized  fully  observable  environments  in  [8] 
by  extending  it  to  the  partially  observable  channel  sensing 
problem.  This  can  be  made  by  assigning  a  Q(s,a)  table  for 
each  secondary  user  j,  where  s  £  S  is  the  channels’  states 
vector  with  u(s)  £lA  =  { 1 ,  •  •  •  ,2L}  being  the  index  of  state 
s  and  a  £  C  is  the  index  of  the  sensed  channel.  However,  we  do 


not  use  the  belief  vector  b j(k)  as  the  actual  state.  Instead,  we 
solve  for  the  values  of  Q(s,a)  in  the  underlying  MDP  model 
by  using  b j(k)  as  a  weighting  vector,  as  described  in  [11], 
Although  this  is  not  the  optimal  solution  of  the  DEC-POMDP 
problem,  [11]  shows  that  this  approach  leads  to  a  near-optimal 
solution  with  a  very  low  computational  complexity  if  the 
algorithm  adopts  an  £-greedy  policy  [8]. 

Since  the  secondary  users  cannot  fully  observe  the  state  of 
the  primary  system  in  the  POMDP  environment,  the  sensing 
policy  of  each  secondary  user  is  based  on  the  belief  vector 
bj (k)  =  [bj(l,k),...,bj(2L,k)].  We  describe  the  Q-learning 


Algorithm  1  Q-learning  Algorithm  for  agent  j  £  ICS 


for  each  s  £  S.  a  £  C  do 
Initialize  Q(s,a)  =  0. 

end  for 

Initialize  the  belief  vector  b  arbitrarily, 
for  each  time  slot  k  do 

Generate  a  random  number  rnd  between  0  and  1. 
if  rnd  <  £  then 

Select  action  a*  randomly, 
else 

Select  action  a*  =  argmaxa  Q b(a). 

end  if 

Execute  action  a*  (i.e.  sense  channel  a*). 

Receive  the  immediate  reward  rj(k). 

Update  po(a*,k,j )  using  the  observation  y(k): 

Po[  ’  ,J)  EU  fl(y(k))pi(a.+k,j) 

Update  the  current  belief  b  according  to  Po(a*,  k,j). 
Evaluate  the  next  belief  vector  b'  based  on  (2). 
Update  the  table  entries  as  follows: 

Q(s,  a*)  <—  Q(s,  a*)  +  A Qb(s,  a*).  Vs  £  S. 

b  <-  b'. 
end  for 


procedure  for  each  user  j  £  ICS  in  Algorithm  1 .  Given  a  belief 
vector  b  =  [6(1),  -  -  •  ,b(2L)\,  we  define  the  Q-value  of  the 
belief  vector  b  as: 


Qb(a)  =  ^2b{u{s))Q(s,a),  (5) 

s£S 

and  the  update  function  as: 


AQb(s,  a)  =  £b(u(s)) 


r j  ( k )  +  7  max  Qb>  ( a ) 

a'  GC 


Q{s,a ) 


We  define  £  to  be  the  learning  rate.  The  Q-value  Q(s,a)  is 
updated  after  taking  every  action  using: 


Q(s,  a)  Q{s,  a)  +  AQb(s,  a).  (6) 


This  update  is  done  for  every  state  s  £  S. 


V.  Simulation  Results 


We  assume  that  all  primary  channels  i  £  C  have  the  same 
transition  probabilities  that  are  governed  by  the  transition 
matrix: 


P,  = 


0.9  0.1 
0.2  0.8 


(7) 


Fig.  4.  Average  Utilization  of  Primary  channels  for  qq  —  0.1. 


Fig.  5.  Average  Utilization  of  Primary  channels  for  Ks  =  3. 


We  define  the  average  spectrum  hole  utilization  as: 


U  = 


EJ\S  Y^°°  *~7~ 

j= 1  1  -L{rj{k)= 1} 

EL  OO  rj- 

i= 1  2^fc=l  -i{si(fc)=0} 


(8) 


The  reinforcement  values  (rewards)  are  selected  as  follows: 

1)  rj(k)  =  1  if  secondary  j  successfully  accesses  channel 
ctj(fc)  at  time  k. 

2)  ry  (fc)  =  —0.5  if  secondary  j  back-off  due  to  collision 
with  another  secondary  user,  and  conditioned  on  the 
channel  being  idle. 

3)  Tj(k)  =  0  if  the  sensed  channel  is  busy. 

In  the  random  sensing  scenario,  the  average  number  of 
secondary  users  that  are  sensing  a  given  primary  channel  is 
E{Zi{k)}  =  L^_{lKal/L)Ksy  where  Z,{k)  G  {1,  •  •  •  :KS} 
is  a  zero-truncated  binomial  random  variable  with  parameters 
Ks  and  1/L.  Thus,  in  the  random  sensing  scenario,  we 
set  the  false  alarm  probability  of  each  secondary  user  to 
a  =  l-  (1  —  ao)1/E^(fe)}. 

On  the  other  hand,  when  applying  the  Q-learning  algo¬ 
rithm,  the  secondary  users  will  be  evenly  distributed  over 
the  channels.  Therefore,  E  {Zi(k)}  =  if  Ks  >  L,  and 
E  {Zi{k)}  =  1  otherwise. 

We  note  that  E  {Zi(k)}  is  conditioned  on  the  channel  i 
being  sensed  (i.e.  conditioned  on  {Zi(k)  ^  0}). 

In  the  following  simulations,  we  model  the  sensing  obser¬ 
vations  of  channel  i  G  C  as  the  output  of  a  BSC  with  cross¬ 
over  probability  z/j ,  and  we  let  v  =  [v\,  ■  •  •  ,vl],  The  use  of 
a  BSC  permits  to  simplify  the  analysis,  yet  it  is  applicable 
to  different  channel  environments  since  v,j  can  depend  on  the 
channel  fading  model,  the  detector  type,  the  signal  and  noise 
power,  and  the  prior  distributions  of  the  information  message. 
Interested  readers  are  referred  to  [12]— [14]  for  the  computation 
of  Vi  under  different  channel  conditions  and  with  different 
detection  methods. 

We  compare  the  performance  of  our  proposed  channel 
access/sensing  mechanism  to  the  greedy  approach  that  was 
proposed  in  [2],  This  greedy  approach  is  equivalent  to  the 
single-user  approach  that  is  defined  in  [3]  and  which  is  applied 
as  a  non-cooperative  myopic  policy  in  multiuser  OSA.  In  Fig. 


Fig.  6.  Collision  rates  with  Primary  channels  for  Ks  =  3. 


4,  we  observe  that  RL  permits  to  achieve  high  utilization  of  the 
spectrum  opportunities  in  the  primary  channels.  In  particular, 
in  the  low-noise  regime,  the  spectrum  utilization  approaches 
100%.  Moreover,  the  RL  algorithm  has  a  significant  advantage 
over  the  greedy  algorithm  of  [2]  because  the  greedy  algorithm 
makes  most  of  the  secondary  users  to  sense  the  channel  that  is 
most  likely  to  be  idle,  thus  ignoring  other  possible  spectrum 
opportunities  and  causing  collisions  among  secondary  users, 
as  stated  in  [3].  This  is  expected  because  the  greedy  algorithm 
is  an  optimal  myopic  strategy  for  a  single-user  case  and  can 
only  be  a  suboptimal  strategy  in  a  multiuser  context.  On 
the  other  hand,  a  simple  random  sensing  policy  that  selects 
randomly  a  channel  at  each  time  instant  can  outperform  the 
greedy  algorithm  of  [2]  as  the  number  of  secondary  users 
I\s  increases.  That  is  because  a  random  policy  reduces  the 
collisions  among  the  secondary  users,  compared  to  the  greedy 
policy  of  [2]. 

Next,  we  assume  all  primary  channels  to  have  the  same 
crossover  probability  and  we  show  in  Fig.  5  the  impact  of 
the  sensing  noise  on  the  performance  of  both  the  Q-learning 
and  random  sensing  systems.  We  see  that  the  performance 
drops  at  a  higher  rate  when  the  crossover  probability  of 
the  sensing  BSC  (z/Q  becomes  greater  than  the  false  alarm 
probability  a  of  each  secondary  user. 


In  Fig.  6,  we  analyze  the  collision  probability  that  results 
from  our  designed  NP  detectors.  Here  we  are  controlling  the 
collision  probability  with  the  primary  channels  during  the  time 
slots  in  which  a  primary  channel  is  being  sensed.  Figure  6 
shows  the  accuracy  of  the  proposed  decentralized  collision 
probability  control  in  maintaining  the  collision  rate  equal  to 
the  prescribed  threshold  ay,  by  using  either  of  the  RL  or 
the  random  sensing  protocols  that  are  proposed  in  this  paper. 
From  Fig.  6  it  can  be  seen  that  these  algorithms  are  robust 
against  channel  impairments  as  captured  by  ;/, .  The  efficiency 
of  these  algorithms  is  due  to  the  fact  that  they  estimate  the 
number  of  secondary  users  that  are  sensing  each  channel,  and 
based  on  this  information,  the  channel  access  rule  is  updated  so 
that  the  collision  rate  with  primary  users  is  maintained  within 
the  required  bound.  We  observe  also  that  the  greedy  policy 
violates  the  prescribed  collision  probability  with  primary  users 
when  the  observation  noise  v,  is  low.  However,  in  this  case, 
the  excess  in  collision  probability  is  not  very  large,  compared 
with  ao,  because  most  of  the  users  sense  the  most  likely  idle 
channel,  whereas  a  small  number  of  users  would  sense  a  busy 
channel  according  to  the  greedy  approach. 

VI.  Conclusion 

In  this  paper,  we  derived  channel  sensing  and  accessing  pro¬ 
tocols  for  secondary  users  in  decentralized  cognitive  networks. 
The  sensing  policy  is  completely  decentralized  and  is  obtained 
by  using  RL.  The  proposed  policy  ensures  efficient  utilization 
of  the  spectrum  resources  since  it  exploits  the  Markovian 
nature  of  the  primary  channel  traffic  and  limits  the  collisions 
among  competing  secondary  users.  Also,  we  have  designed 
a  secondary  detector  that  maximizes  the  detection  probability 
of  the  idle  channels  while  satisfying  the  collision  probability 
constraint  imposed  by  primary  users.  The  designed  policies 
are  characterized  by  their  robustness  and  accuracy,  and  help 
to  enhance  the  cognitive  capabilities  of  secondary  users. 
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